21 research outputs found

    Audio Indexing on the Web: a Preliminary Study of Some Audio Descriptors

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceThe "Invisible Web" is composed of documents which can not be currently accessed by Web search engines, because they have a dynamic URL or are not textual, like video or audio documents. For audio documents, one solution is automatic indexing. It consists in finding good descriptors of audio documents which can be used as indexes for archiving and search. This paper presents an overview and recent results of the RAIVES project, a French research project on audio indexing. We present speech/music segmentation, speaker tracking, and keywords detection. We also give a few perspectives of the RAIVES project

    Projet RAIVES (Recherche Automatique d'Informations Verbales Et Sonores) vers l'extraction et la structuration de données radiophoniques sur Internet

    Get PDF
    Rapport de contrat.Internet est devenu un vecteur important de la communication. Il permet la diffusion et l'Ă©change d'un volume croissant de donnĂ©es. Il ne s'agit donc plus seulement de collecter des masses importantes " d'informations Ă©lectroniques ", mais surtout de les rĂ©pertorier, de les classer pour faciliter l'accĂšs Ă  l'information utile. Une information, aussi importante soit-elle, sur un site non rĂ©pertoriĂ©, est mĂ©connue. Il ne faut donc pas nĂ©gliger la part du " Web invisible ". Le Web invisible peut se dĂ©finir comme l'ensemble des informations non indexĂ©es, soit parce qu'elles ne sont pas rĂ©pertoriĂ©es, soit parce que les pages les contenant sont dynamiques, soit encore parce que leur nature n'est pas ou difficilement indexable. En effet, la plupart des moteurs de recherche se basent sur une analyse textuelle du contenu des pages, mais ne peuvent prendre en compte le contenu des documents sonores ou visuels. Il faut donc fournir un ensemble d'Ă©lĂ©ments descripteurs du contenu pour structurer les documents afin que l'information soit accessible aux moteurs de recherche. S'agissant de documents sonores, le but de notre projet est donc, d'une part, d'extraire ces informations et, d'autre part, de fournir une structuration des documents afin de faciliter l'accĂšs au contenu. L'indexation par le contenu de documents sonores s'appuie sur des techniques utilisĂ©es en traitement automatique de la parole, mais doit ĂȘtre distinguĂ©e de l'alignement automatique d'un texte sur un flux sonore ou encore de la reconnaissance automatique de la parole. Ce serait alors rĂ©duire le contenu d'un document sonore Ă  sa seule composante verbale. Or, la composante non-verbale d'un document sonore est importante et correspond souvent Ă  une structuration particuliĂšre du document. Par exemple, dans le cas de documents radiophoniques, on voit l'alternance de parole et de musique, plus particuliĂšrement de jingles, pour annoncer les informations. Ainsi, nous pouvons considĂ©rer un ensemble de descripteurs du contenu d'un document radiophonique : segments de Parole/Musique, " sons clĂ©s ", langue, changements de locuteurs associĂ©s Ă  une Ă©ventuelle identification de ces locuteurs, mots clĂ©s et thĂšmes. Cet ensemble peut ĂȘtre bien entendu enrichi. Extraire l'ensemble des descripteurs est sans doute suffisant pour rĂ©fĂ©rencer un document sur Internet. Mais il est intĂ©ressant d'aller plus loin et de donner accĂšs Ă  des parties prĂ©cises du document. Chaque descripteur doit ĂȘtre associĂ© Ă  un marqueur temporel qui donne accĂšs directement Ă  l'information. Cependant, l'ensemble des descripteurs appartenant Ă  des niveaux de description diffĂ©rents, leur organisation n'est pas linĂ©aire dans le temps : un mĂȘme locuteur peut parler en deux langues sur un mĂȘme segment de parole, ou encore sur un segment de parole dans une langue donnĂ©e, plusieurs locuteurs peuvent intervenir. Il faut donc aussi ĂȘtre capable de fournir une structuration de l'information sur diffĂ©rents niveaux de reprĂ©sentation

    StoViz: Story Visualization of TV Series

    No full text
    International audienceRecent TV series tend to have more and more complex plot. They follow the lives of numerous characters and are made of multiple intertwined stories. In this paper, we introduce StoViz, a web-based interface allowing a fast overview of this kind of episode structure, based on our plot de-interlacing system. StoViz has two main goals. First, it provides the user with a useful overview of the episode by displaying each story separately and a short abstract extracted from them. Then, it allows an efficient visual comparison of the output of any automatic plot de-interlacing algorithm with the manual annotation in terms of stories and is therefore very helpful for evaluation purposes. StoViz is available online at http://stoviz.niderb.fr

    Toward Plot De-Interlacing in TV Series using Scenes Clustering

    No full text
    International audienceMultiple sub-stories usually coexist in every episode of a TV series. We propose several variants of an approach for plot de-interlacing based on scenes clustering − with the ultimate goal of providing the end-user with tools for fast and easy overview of one episode, one season or the whole TV series. Each scene can be described in three different ways (based on color histograms, speaker diarization or automatic speech recognition outputs) and four clustering approaches are investigated, one of them based on a graphical representation of the video. Experiments are performed on two TV series of different lengths and formats. We show that semantic descriptors (such as speaker diarization) give the best results and underline that our approach provides useful information for plot de-interlacing

    Hierarchical Framework for Plot De-interlacing of TV Series based on Speakers, Dialogues and Images

    No full text
    International audienceSince the 90s, TV series tend to introduce more and more main characters and they are often composed of multiple intertwined stories. In this paper, we propose a hierarchical framework of plot de-interlacing which permits to cluster semantic scenes into stories: a story is a group of scenes not necessarily contiguous but showing a strong semantic relation. Each scene is described using three different modalities (based on color histograms, speaker diarization or automatic speech recognition outputs) as well as their multimodal combination. We introduce the notion of character-driven episodes as episodes where stories are emphasized by the presence or absence of characters, and we propose an automatic method, based on a social graph, to detect these episodes. Depending on whether an episode is character-driven or not, the plot-de-interlacing -which is a scene clustering- is made either through a traditional average-link agglomerative clustering with speaker modality only, either through a spectral clustering with the fusion of all modalities. Experiments, conducted on twenty three episodes from three quite different TV series (different lengths and formats), show that the hierarchical framework brings an improvement for all the series

    Face-and-Clothing Based People Clustering in Video Content

    No full text
    International audienceContent-based people clustering is a crucial step for people indexing within video documents. In this paper, we investigate the use of both face and clothing features. A method of extracting a keyface for each video sequence is proposed. An algorithm based on the average of the N-minimum pair distances between local invariant features is used in order to resolve the problem of face matching. An original method for clothing matching is proposed based on 3D histogram of the dominant color. A 3-levels hierarchical bottom-up clustering that combines local invariant features, skin color, 3D histogram and clothing texture is also described. Experiments and results show the efficiency of the proposed clustering system

    Audiovisual diarization of people in video content

    Get PDF
    International audienceAudio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa

    Vers un Résumé Automatique de Séries Télévisées basé sur une Recherche Multimodale d'Histoires

    No full text
    Modern TV series have complex plots made of several intertwined stories following numerous characters. In this paper, we propose an approach for automatically detecting these stories in order to generate video summaries and we propose a visualization tool to have a quick and easy look at TV series. Based on automatic scene segmentation of each TV series episode (a scene is defined as temporally and spatially continuous and semantically coherent), scenes are clustered into stories, made of (non necessarily adjacent) semantically similar scenes. Visual, audio and text modalities are combined to achieve better scene segmentation and story detection performance. An extraction of salient scenes from stories is performed to create the summary. Experimentations are conducted on two TV series with different formats

    Segmenting TV Series into Scenes using Speaker Diarization

    No full text
    International audienceIn this paper, we propose a novel approach to perform scene segmentation of TV series. Using the output of our existing speaker diarization system, any temporal segment of the video can be described as a binary feature vector. A straightforward segmentation algorithm then allows to group similar contiguous speaker segments into scenes. An additional visual-only color-based segmentation is then used to refine the first segmentation. Experiments are performed on a subset of the Ally McBeal TV series and show promising results, obtained with a rule-free and generic method. For comparison purposes, test corpus annotations and description are made available to the community
    corecore